H3AGWAS: a portable workflow for genome wide association studies

Background Genome-wide association studies (GWAS) are a powerful method to detect associations between variants and phenotypes. A GWAS requires several complex computations with large data sets, and many steps may need to be repeated with varying parameters. Manual running of these analyses can be tedious, error-prone and hard to reproduce. Results The H3AGWAS workflow from the Pan-African Bioinformatics Network for H3Africa is a powerful, scalable and portable workflow implementing pre-association analysis, implementation of various association testing methods and post-association analysis of results. Conclusions The workflow is scalable—laptop to cluster to cloud (e.g., SLURM, AWS Batch, Azure). All required software is containerised and can run under Docker or Singularity.


Computational comparison between the BIGwas and H3AGWAS workflow 2.1 Introduction
This section documents a computational comparison of the BIGwas and H3AGWAS workflow workflows in response to the findings of Kässens et al. that BIGwas is significantly faster than H3AGWAS workflow for QC for medium and larger files. For example, Kässens et al. found that data set with 5k individuals and 50k SNPs takes 8 minutes (BIGwas) versus 15 minutes (H3AGWAS workflow), and a data set with 20,554 individuals and 700k SNPs takes 135m (BIGwas) versus 537m (H3AGWAS workflow) and for even larger data sets that they could not reasonably complete execution of H3AGWAS workflow. (It must be emphasised that comparing workflows requires comparing multiple factors and in our view this is not the most important factor, but Kässens et al have made very serious negative findings, which we believe needs to be addressed. We do not agree with their findings)

Data sets used
The following data sets were used: • The example data set that comes with BIGwas -this is a set based on 1000 Genomes data with 2504 individuals and 50k SNPs.
• A simulated data set (sim1 22142 individuals and 2.267 million SNPs.
One difference between the two workflows is that H3AGWAS workflow expects that all the genotype data is in one PLINK file, while BIGwas allows multiple files to be input which are then merged, which exposes further parallelism. To make the comparison fair we compare the H3AGWAS workflow times with two separate runs of the BIGwas workflow: • The BIGwas example data set is provided as two separate PLINK data sets -one the "cases" and one the "controls", each with approximately 2500 individuals. We compare the H3AGWAS workflow with the merged data setas input with the BIGwas on both the merged data and on the original split files.
• For the AWI-Gen data, we split into roughly two halves and artificially declare the two halves as cases and controls. We compare the H3AGWAS workflow on the original AWI-Gen data with BIGwas on the original data and on the split data. For the sim1 data set a similar comparison was made.

Computational setup
We used Nextflow 21.04.1 and used Singularity with the images provided by the two tools. In both cases, both the Nextflow repos and the Singularity images were already downloaded and installed. We performed the experiments on a single machine (no scheduler) and using SLURM on a cluster.

Single machine execution
We used a machine with a dual Xeon Silver 4214 CPUs running at 2.20GHz (24 physical cores, 48 hyper-threaded cores) and 128GB of RAM, and all data inputs and outputs were stored on a Seagate ST2000NM0008 2TB SATA Hard Drive. The machine was otherwise unloaded. We ran both of these on the default settings -presumably both developers chose appropriate default settings so that is fair enough. However, H3AGWAS workflow has performance parameters that the user can set if they have more powerful computers. The max_plink_cores parameter is  Table 1: Comparison between H3AGWAS workflow and BIGwas workflow using QC script and 3 different data sets using single machine execution by default set to 4 -this limits any PLINK process to use 4 cores (and makes 4 cores available and so hence counts to CPU hours whether used or not). If we run the Nextflow with the --max_plink_cores=12, the elapsed time for the AWI-Gen data set drops from 1383s to 750s at the cost of accounted CPU hours going to 4.2 CPU hours. This is a trade-off the user must consider. No doubt there are similar changes that could be made to the BIGwas workflow. Note the difference in the AWI-Gen and sim1 data set for the H3AGWAS workflow is a factor of 5.2. In principle we would expect the overall computational cost to scale quadratically with number of SNPs as the single biggest computational costs are steps which are quadratic. However as these components take longer there is less task parallelism as a proportion of the overall cost. The BIGwas workflow also scales super-quadratically but by a greater factor (our superficial observation is that the bulk of this extra cost is as at similar points in the computation).

Cluster execution
We tested on our production University Research Cluster: SLURM 20.11.8, CentOS7.9, Singularity 3.6.3 (default Singularity OSG release). The cluster is heterogeneous so we set Nextflow clusterOptions to execute only on 20 nodes with dual core Intel Xeon Silver 4114 CPUs running at 2.2GHz (20 physical, 40 hyper-threaded cores per node -note that these machines are slower than the one we tested above). Since this is a production cluster we were unable to test while the cluster otherwise completely idle but we could test late on weekend with only a few other jobs running so we do not think that this affected the results (we manually inspected that the jobs ran on machines not being used by other jobs).
Singularity issues: We were unable to run BIGwas on the cluster in its default Singularity setting. Like many production HPC systems, the cluster follows recommendations to not allow Singularity with setuid enabled (https://sylabs.io/guides/latest/user-guide/security.html). The disadvantage of this is that standard Singularity SIF images cannot be directly executed but must be copied (unsquashed) to a temporary disk for each separate Nextflow process 1 . The BIGwas image is 11GB in size in SIF (compressed) format and the computational cost of unsquashing, especially when multiple processes are doing this in parallel made running the workflow impractical. The same Singularity issue applies to H3AGWAS workflow, of course. However, the H3AGWAS workflow container design approach is to have several, specialised containers rather than one monolithic container. For example, the workhorse py3plink container is ≈450MB in size -so although there is a significant penalty for running H3AGWAS workflow on the cluster using Singularity compared to running the natively installed software, it is not an outrageous penalty. In order to perform the testing below, we were able to enable setuid for (this requires root privileges), but this is not something we are allowed to leave for extended periods. In many environments setuid is enabled in Singularity installations (as it was in our testing on the single machine), so many users will not face this problem. But we suspect that other Singularity users in production HPC environment will run into the same problem as us. (We also tested Singularity 3.8 and 3.9 and it had the same problem).

Conclusion
As we have indicated, performance is not the primary measure of a workflow because ultimately the costs depend on the underlying software used and the workflow designer can neither take too much credit nor blame for this. However, it is important to demonstrate the workflow can expose appropriate parallelism, which we believe has been demonstrated. Certainly we do not believe that the experimental evidence supports the claim that the BIGwas workflow is faster than H3AGWAS workflow.

Description and test of different scripts of H3AGWAS workflow : CPUs, Time
This section shows additional testing of H3AGWAS workflow on our SLURM cluster as shown.
Again since the cluster is a production cluster we were only able to run it on a lightly loaded cluster not on a cluster that was idle. The purpose of this testing is to give indicative real-world costs. For the tests done, the workflow execution is shown graphically as a directed acyclic graph and the computational cost of the individual components is shown (of course, many of the individual components can be done in parallel).

Quality Control of genetics data
• Objectives : apply a quality control on genetics data.
• Input : workflow take as input PLINK file from genomics data, phenotype, sex phenotype.
• Individual filter : -Apply sex control with X chromosome and sex phenotype.
-missingness -relatedness • SNPs filter : -minor allele frequency -heterozygosity -duplicated markers -missingness • Output : -report in pdf is produce in PDF describing each steps with different filters -PLINK file after quality control with frequencies distribution, hardy Weinberg equilibrium... see example 1 -intermediate files produce by workflow.
• Test: QC workflow has been apply for 12,000 individuals with genotype using h3array positions (2.4 millions positions) using the cluster and SLURM -see the statistics in table 4.
An overview of the execution can be found in Figure 2 and the detailed computational results in Table 4.

Introduction
The input file for this analysis was KGPH3abionet.orig.{bed,bim,fam}. This data includes: The input files and md5 sums were Note that some statistics are shown twice -on the raw input data and on the final result, since these statistics are needed or different purposed.

Approach
The pipeline takes an incremental approach to QC, trading extra computation time in order to achieve high quality while removing as few data as possible. Rather than applying all cut-offs at once, we incrementally apply cutoffs (for example, removing really badly genotyped SNPs before checking for heterozygosity will result in fewer individuals failing heterozygosity checks).

QC Phase 0
This phase only removes SNPs which are duplicated (based on SNP name). No other QC is done and so the output of this phase should really be considered as raw data. 2. 495 individuals had discordant sex information -the full PLINK report can be found in KGPH3abionet-nd. sexcheck and an extract of the PLINK report showing only the failed reports can be found can be found in KGPH3abionet-nd.badsex, and a more detailed analysis can be found in Section 5. Figure 1 shows the spread of missingness per SNP across the sample, whereas Figure 2 shows the spread of missingness per individual across the sample. Note that this shows missingness before any filtering or cleaning up of the data.
Minor allele frequency. Table 1 on page 2 shows the minor allele frequency spectrum for the raw data. The number of monomorphic SNPs is shown in the first row. Note that some of the MAFs with very low MAF are actually monomorphic, with the polymorphisms due to genotyping error. Figure 3 on page 4 shows the cumulative distribution of MAF. This can be used to determine an appropriate MAF cut-off. Note that the minor allele is determined with respect to the frequency spectrum in this data -'minor' is not synonym for alternate or non-reference allele, or the allele that has minor frequency in some other data set. Under this definition the MAF is always ≤ 0.5. Hardy Weinberg Statistics : Figure 4 shows the cumulative distribution of Hardy-Weinberg pvalue for the SNPs in the raw data. This can be used to assess the cost of excluding SNPs with a We expect the curve the fit tightly to the main diagonal, except for a very small p values (and this deviation may not be observable on a linear plot). The QQ plot for the HWE scores can be found in Figure 5. The region of deviation from the line of expected versus observed p-values will be more observable here. Note that if there are very small observed p-values in relation to expected values, the expected curve may be very flat -pay attention to the x and y axis coordinates. Since we are plotting on a negative log-scale, note that regions of low probabiliy of deviation from HWE (p-value close to 1) are at the left, and regions of high probability (low p-value) are at the right. The tail of the plot where deviation from the diagonal occurs is likely to be a good cut-off to use for QC.
However, care needs to be taken not exclude SNPs. We are using HWE p-value as a proxy for something having gone wrong with the sample or genotyping, and this is a little crude. In a study with participants from different population groups in a recently admixed group, deviation from HWE is expected and does not indicate problems with QC. Moreover, in a disease study, it is likely that those individuals that are affected, those SNPs that are associated with the condition under study will not in be in HWE. Care needs to be taken -it is easier to handle in a pure case/control study. In a population cross-section study with different conditions being considered, it might be advisable to re-run the QC pipeline for HWE for each study. The current version of the pipeline does not support his more complex analysis, though we plan to extend.

Page 3
Completed on 2022/07/08 at 09:20:36 2 QC PHASE 0 Figure 3 Cumulative frequency of SNPs. For a frequency shown on the x-axis, the corresponding y-value shows the proportion of SNPs with frequency at least this frequency; that is, it shows the proportion of SNPs which will remain if the MAF filter of this x-value is chosen. Figure 4 HWE distribution. For an HWE-value shown on the x-axis, the corresponding y-value shows the proportion of SNPs with HWE p-value at least this frequency; that is, it shows the proportion of SNPs which will be removed if the HWE filter of this x-value is chosen. File is KGPH3abionet-nd-inithwe.pdf.  Table 4: Statistics resumé of the QC workflow: process -Nextflow process name; Tot. hourstotal hours used by NF process; % times -percentage of times used by process compared to other process ; % cpu number used (Mean) -mean % cpu number used by the process; Max mem (MB) is maximum of memory (resident set size) used by one process; NF processes -number of Nextflow process used for the steps

Association
• input : -phenotype file and one or more phenotype, covariates -genetics data in plink file and in option dosage : bgen (regenie, SAIGE, fastGWA and BOLT-LMM), VCF (SAIGE) and impute2 (BOLT-LMM) • output : -each summary statistics of each software and phenotype used and pdf report contains for each combination 3 -report with Manhattan, qq plot and best result -relatedness comptued for each software

Fine-mapping
• Objective: apply a fine-mapping on significant regions of summary statistics result • Input data : summary statistics, causal variant number and genetics data in PLINK format • identify region to apply fine-mapping on full summary statistics using PLINK clump to identify lead SNPs.
• Steps -for each region apply different algorithms to find the number of independent SNPs, putative causal variant and credible interval with different software for fine-mapping : COJO (step-wise model selection procedure to select independently associated), FINEMAP (stochastic and conditional algorithm), Caviarbf and PAINTOR software with possibility to use eQTL information.
• Output 4: -intermediate file produce in workflow and by each software -file contains all result merge -figures plot as locus zoom with has been had probability of each software of finemapping.
• Test : Fine-mapping has been done on summary statistics obtained with cholesterol output of association testing done with GEMMA with AWI-Gen data set. We obtained 40 windows with p < 5 × 10 −8 : for each region different software was used fine-mapping Figure 5 gives an overview of the execution and  Table 5: Statistics resumé of fine-mapping workflow running on cluster, using cholesterol phenotype. Process -it is Nextflow process name; Tot. hours -total hours used by NF process; % times -percentage of times used by process compared to other process ; % cpu number used (Mean)mean % cpu number used by the process; Max mem (MB) is maximum of memory used by one process; NF processes -number of Nextflow process used for the steps This report gives a brief overview of the run of the association testing pipeline.
• You were testing for the following phenotypes pheno_qt2 • You were using the following covariates [] 2 Principal Component Analysis of Participants

PRINCIPAL COMPONENT ANALYSIS OF PARTICIPANTS
A summary of the data for pheno_qt2 can be found in the Table 1, transformed using different transforms. A histogram is found in Figure 2.   3 Result of Gemma analysis : phenotype pheno-qt2 All the results from the GEMMA analysis can be found in the gemma directory. The result of the GEMMA analysis is shown for phentoytype pheno-qt2. The file with association statistics is found in imput_data-pheno-qt2.assoc.txt. The top 10 results are shown in Table 2: The Manhatten plot can be found in Figure 3. The corresponding QQ-plot can be found in Figure 4.   Figure 4: Example of report generated by fine-mapping pipeline contained page 1)locus-zoom with lead snps defined using stepwise model selection procedure (gcta), credible position from fine-mapping softwares, post-probability of fine-mapping software, information relative to GWAS catalog, genes. Pages 2 : legends. Pages 3 : GWAS catalog information's.

Heritability
• Objectives : compute heritability and/or co-heritability using genetics diversity and phenotype or/and summary statistics • Input : genetics data and phenotype or/and summary statistics.
• Steps : -format and prepared files -build matrix of relatedness or/and genetic relationships matrix for GCTA, GEMMA -computed heritability using GEMMA and LDSC using summary statistics and GCTA, GEMMA and BOLT-LMM using genetics and phenotype.
• Test: using 4 phenotypes of lipid, genotype and summary statistics obtained using association testing result, we ran heritability and co-heritability using the cluster.
• Output : pipeline gave all intermediate file from each software but also a barplot with each heritability (see Figure 6) An overview is shown in Figure 7 and the detailed computational cost is shown in  Table 6: Statistics resume of heritability workflow running with cluster, using 4 phenotypes and 10,000 individuals and corresponding summary statistics. Process -Nextflow process name; Tot. hours -total hours used by NF process; % times -percentage of times used by process compared to other process ; % cpu number used (Mean) -mean % cpu number used by the process; Max mem (MB) is maximum of memory (RSS) used by any process; NF processes -number of Nextflow processes used for the steps.   • Steps -Downloads GWAS catalog.

Simulations workflow
-Extracts and format GWAS catalog file with extraction of positions and effect using list of SNPs.
-Downloads genomics data of positions extracted from GWAS catalog and array.
-Extracts independent positions from position of GWAS catalog using "-clump" of PLINK and genetic data download.
-Uses Genomics data and z values extracted from GWAS catalog using independent positions to build phenotype using GCTA.
-Output : * genotype in plink format of positions from array defined in input. * Quantitative and qualitative phenotype with position and genotype used to build phenotype corresponding and information relative to GWAS catalog and used to build phenotype.
• Test : build phenotypes using data of 1000 Genomes project, GWAS catalog and diabetes as phenotype.
See Table 8 for the detailed computational costs and Figure 8 for an overview of the different steps.

Process
Tot   -splits file by chromosome (optional) -converts PLINK to VCF.
-cleans and rename position name.
-fix allele using BCFtools • Output : VCF file and VCF file by chromosome.
• Test : convert data after QC of 10,796 individuals using workflow in VCF format See Figure 9 for an overview of the process and Table 9 for the detailed computational costs.  Table 9: Summary of the result of the workflow to format PLINK to VCF prepare data for imputation. Process -Nextflow process name; Tot. hours -total hours used by NF process; % times -percentage of times used by process compared to other process; % cpu number used (Mean) -mean % cpu number used by the process; Max mem (MB) is maximum of memory used by one process (resident set size); NF processes -number of Nextflow processes used for the steps 3.6.2 Convert VCF to PLINK or other format • Objective : convert output from imputation in PLINK or other format to run association testing.
• Input : list of VCF, genetic map.
• Steps -Computed various statistics as frequency and imputation score -Filter VCF by score and frequency -Transform file to PLINK.
-check for rs duplicate and correct.
-merge all files PLINK by chromosome. Figure 9: Flowchart of workflow to convert PLINK format to VCF to prepare data for imputation Figure 11: flowchart of workflow to convert VCF file from imputation into PLINK format  Table 10: Summary of result of workflow to format VCF after imputation in PLINK format using data after imputation obtained in QC. Process -Nextflow process name; Tot. hours -total hours used by NF process; % times -percentage of times used by process compared to other process ; % cpu number used (Mean) -mean % cpu number used by the process; Max mem (MB) is maximum of memory used by one process; NF processes -number of Nextflow processes used for the steps.

Multi-trait analyse using MTAG
• Objective : Analyse multi trait using summary statistics • Input : result of summary statistics from various phenotype

• Steps
-format each files to prepared input for MTAG software -Run MTAG with all summary statistics -Run MTAG seletected 2 by 2 each summary statistics.
• Output : result of mtag sofware (summary statistics), and report in PDF as association pipeline • Test : used output summary statistics (14 millions SNPs) for 4 SNPs An overview is shown in Figure 12 and the detailed computational costs in  Table 11: Summary of result of workflow doing a multi trait analysis using MTAG: Process -Nextflow process name; Tot. hours -total hours used by NF process; % times -percentage of times used by process compared to other process; % cpu number used (Mean) -mean % cpu number used by the process; Max mem (MB) is maximum of memory used by one process; NF processes -number of Nextflow processes used for the steps.  Table 13: Cost of running meta-analysis workflow using Wits cluster, using 3 summary statistics and 14 millions of positions by summary statistics. Column labels as in Table 12